Lexicon-directed Segmentation and Tagging of Sanskrit
نویسنده
چکیده
We propose a methodology for Sanskrit processing by computer. The first layer of this software, which analyses the linear structure of a Sanskrit sentence as a set of possible interpretations under sandhi analysis, is operational. Each interpretation proposes a segmentation of the sentence as a list of tagged segments. The method, which is lexicon directed, is complete if the given (stem forms) lexicon is complete for the target corpus. It uses an original design for a finite-state transducers toolkit, based on functional programming principles. Further layers of this computational linguistics architecture are discussed. 1 COMPUTATIONAL LINGUISTICS AND SANSKRIT Descriptive linguistics is the study of natural language phenomena. Theoretical linguistics strives to provide formal models of linguistic activity. Noam Chomsky initiated modern theoretical linguistics with successive theories (context-free and transformational grammars, government and binding, minimalism). These theories usually provide a generative grammar paradigm, by which means any valid sentence in the language may be generated. With the advent of computers a still more formal approach is attempted, where executable programs process a digital representation of natural language into information structures, which provide some degree of understanding of the given text or discourse. In the pioneer days of the fifties, automated translation systems were thus attempted, although difficulties were largely underestimated, leading to doubts about the feasability of natural language understanding by computers. 50 years later, however, the situation
منابع مشابه
An Effort to Develop a Tagged Lexical Resource for Sanskrit
In this paper we present our efforts the first time of its kind in the history of Sanskrit to design and develop a structured electronic lexical Resource by tagging a Traditional Sanskrit dictionary. We narrate how the whole unstructured raw text of Vaacaspatyam – an encyclopedic type of Sanskrit Dictionary has been tagged to form a user friendly e-lexicon with structured and segregated informa...
متن کاملA functional toolkit for morphological and phonological processing, application to a Sanskrit tagger
We present the Zen toolkit for morphological and phonological processing of natural languages. This toolkit is presented in literate programming style, in the Pidgin ML subset of the Objective Caml functional programming language. This toolkit is based on a systematic representation of finite state automata and transducers as decorated lexical trees. All operations on the state space data struc...
متن کاملEffective Subsequence-based Tagging for Chinese Word Segmentation
Effective Subsequence-based Tagging for Chinese Word Segmentation Hai Zhao, Chunyu Kit (1. Department of Chinese, Translation and Linguistics, City University of Hong Kong, 83 Tat Avenue, Kowloon, Hong Kong SAR, China) Abstract: The research of automatic Chinese word segmentation has been advancing rapidly in recent years, especially since the First International Chinese Word Segmentation Bakeo...
متن کاملJoint Arabic Segmentation and Part-Of-Speech Tagging
Arabic has a very complex morphological system, though a very structured one. Character patterns are often indicative of word class and word segmentation. In this paper, we e xplore a novel approach to Arabic word segmentation and part-of-speech tagging relying on character information. The approach is lexicon-free and does not require any morphological analysis, eliminat ing the factor of dict...
متن کاملFormal Structure of Sanskrit Text: Requirements Analysis for a Mechanical Sanskrit Processor
We discuss the mathematical structure of various levels of representation of Sanskrit text in order to guide the design of computer aids aiming at useful processing of the digitalised Sanskrit corpus. Two main levels are identified, respectively called the linear and functional level. The design space of these two levels is sketched, and the computational implications of the main design choices...
متن کامل